Predicting Missing Attribute Values based on Frequent Itemset and RSFit
نویسندگان
چکیده
How to process missing attribute values is an important data preprocessing problem in data mining and knowledge discovery tasks. A commonly-used and naive solution to process data with missing attribute values is to ignore the instances which contain missing attribute values. This method may neglect important information within the data and a significant amount of data could be easily discarded. Some methods, such as assigning the most common values or assigning an average value to the missing attribute, make good use of all the available data. However the assigned value may not come from the information which the data originally derived from, thus noise is brought to the data. We introduce an integrated approach ItemRSFit to effectively predict missing attribute values by combining frequent itemset and RSFit approaches together. Frequent itemset is generated from the association rules algorithm and it displays the correlations between different items in a transaction data set. Using frequent itemset as a knowledge base to predict missing attribute values is shown to have a high prediction accuracy. However this approach alone cannot predict all the existing missing attributes. RSFit is a newly developed approach to predict missing attribute values based on the similarities of attribute-value pairs by only considering attributes contained in the core or the reduct of the data set. The RSFit approach provides a faster prediction and can be used for the cases that are not covered by the itemset approach. Empirical studies on UCI data sets and a real world data set demonstrate a significant increase of predicting accuracy obtained from this new integrated approach.
منابع مشابه
Comparisons on Different Approaches to Assign Missing Attribute Values
A commonly-used and naive solution to process data with missing attribute values is to ignore the instances which contain missing attribute values. This method may neglect important information within the data, significant amount of data could be easily discarded, and the discovered knowledge may not contain significant rules. Some methods, such as assigning the most common values or assigning ...
متن کاملA Novel Algorithm for Association Rule Mining from Data with Incomplete and Missing Values
Missing values and incomplete data are a natural phenomenon in real datasets. If the association rules mine incomplete disregard of missing values, mistaken rules are derived. In association rule mining, treatments of missing values and incomplete data are important. This paper proposes novel technique to mine association rule from data with missing values from large voluminous databases. The p...
متن کاملPredicting Missing Attribute Values Using k-Means Clustering
Problem statement: Predicting the value for missing attributes is an important data preprocessing problem in data mining and knowledge discovery tasks. Several methods have been proposed to treat missing data and the one used more frequently is deleting instances containing at least one missing value of a feature. When the dataset has minimum number of missing attribute values then we can negle...
متن کاملA New Algorithm for High Average-utility Itemset Mining
High utility itemset mining (HUIM) is a new emerging field in data mining which has gained growing interest due to its various applications. The goal of this problem is to discover all itemsets whose utility exceeds minimum threshold. The basic HUIM problem does not consider length of itemsets in its utility measurement and utility values tend to become higher for itemsets containing more items...
متن کاملEstimating Missing Data in Data Streams
Networks of thousands of sensors present a feasible and economic solution to some of our most challenging problems, such as real-time traffic modeling, military sensing and tracking. Many research projects have been conducted by different organizations regarding wireless sensor networks; however, few of them discuss how to estimate missing sensor data. In this research we present a novel data e...
متن کامل